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working with computer data files that contain records with incomplete 
data. It indicates choices the analyst must make and the criteria for 
making those choices in regard to the following questions: (1) What 
resources are available for performing the imputation? (2) How big is 
the data file? (3) What is the purpose for imputing missing data? (4) 
What structures exist in the recorded variables? (5) What is the 
pattern of missing data? (6) What assumptions are acceptable for this 
imputation? Answers to these questions constitute recommendations for 
imputation procedures. Several alternative recommendations and the 
conditions that determine the appropriateness of use are considered. 
The final section of the guidebook contains instructions for using 
PROC IMPUTE created by the Statistical Analysis Group in Education 
for the National Center for Education Statistics, and for 
interpreting its results. Appendices include: (1) Processing Time by 
Numbers of Variables and PROC IMPUTE; and (2) Sample Statistical 
Analysis System Program to Reweight for Total Nonresponse. (JAZ) 
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I. ALTERNATIVE IMPUTATION STRATEGIES 



What This Guidebook Is About 

This guidebook is for data analysts who are working with computer dat< 
files that contain records with incomplete data. Specifically, the 
guidebook pertains to data from surveys that had some nonresponse. The 
guidebook indicates choices the analyst must make and criteria for making 
those choices. Because dealing with missing data can be facilitated by 
careful design of the survey instrument and data collection, this guide- 
book includes useful information for survey designers as well as data 
analysts* 

The Guidebook for Imputation was prepared as supporting documentation 
for a particular missing data imputation procedure developed under NOES 
contract, but as we shall see, the choice of best procedure depends on 
both the contents of the data file and the objectives of the analyst. To 
be more precise, the analyst must address the following six questions to 
decide on an imputation procedure. Each of these will be discussed in 
turn in this guidebook. 

1* What resources are available for performing the imputation? 

2. How big is the data file? 

3* What is the purpose for imputing missing data? 

4. What structures exist in the recorded variables? 

5. What is the pattern of missing data? 

6. What assumptions are acceptable for the imputation? 

The answers to these questions will constitute recommendations for 
imputation procedures. We shall consider these in turn, and then list a 
series of specific alternative recommendations, indicating the conditions 
that determine the appropriateness of use of each of several alternative 
procedures. The final section of this guidebook contains instructions fo 
using PROC IMPUTE, created by SAGE for NCES, and for interpreting its 
results. 



Many agencies have done a substantial amount of work recently to 
improve imputation procedures, to which this guidebook only refers in 
terms of general principles and findings. Interested readers who wish to 
pursue alternatives other than the use of standard packages might refer to 
Aziz and Scheuren (1978) and Madow (1979) for compendia of different 
perspectives, models, procedures, and findings. 

There are basically four types of imputation procedures: 

1. superficial methods, such as ignoring missing data, using complete 
cases only, or assigning the mean Jr modal value for all missing 
cases; 

2. weighting methods, in which missing values are implicitly filled 
in by increasing the weights assigned to similar cases that 
responded; 

3. single-valued explicit imputation, in which a response ii inserted 
into the data file in place of the miasiug data coda; and 

4. multi-valued explicit imputation, in which replicate files are 
created with different responses inserted based on different 
underlying imputation models* 

Among the weighting methods . there are two major types, those that 
incorporate external information about response distributions, such as 
raking ratio estimators (Oh & Scheuren, 1978), and those that rely purely 
on the information contained in the survey data file. Although provisions 
for performing descriptive analyses on weighted data are available in 
standard statistical packages, these packages contain no formal procedures 
for performing the reweighting to deal with nonresponse. This presents no 
great problem in the case of weighting based on information in the*, survey 
file, because the programming to perform the reweighting is quite simple. 
An example of a program for reweighting in the SAS language is shown in 
Appendix B. 

Among the single-valued explicit imputation methods , there are three 
alternative categories: 

a. synthetic estimates, such as regression function values; 

b. "hot deck 11 estimates, which assign a response taken from some 
other case on the file; and 



c. distributional estimates, which assign a response randomly from an 
appropriately selected distribution. 

Of these i the hot deck methods have received most attention recently. 
Synthetic estimates are available, however, in the BMDP system (Dixon & 
Brown, 1979), while the oi:her two, "hot deck" and distributional esti- 
mates, have not been disseminated in ccumon statistical packages. The 
procedure described in detail in this guidebook, PROC IMPUTE, is a 
distributional estimation method, embedded in the SAS package (Helwig & 
Council, 1979) for easy access. 

Multi-valued explicit imputation , proposed by Rubin (1978), consists 
of imputing values several times, using different models of nonresponse 
and different random numbers, to create several copies of the file of 
data. Variance in the results of analyses among these files then provides 
an estimate of "error due to imputation. 11 It has not been widely used 
because of its unpleasant requirement that all users of imputed data 
repeat all their Analyses several times. This method may yet be proven to 
be necessary, however. 

To decide among these methods, and to decide how best to plan ahead 
for imputation, survey designers and data analysts must consider the six 
questions stated above. We discuss each in turn. 

(1) What resources are available for performing the imputation ? 

Imputation of missing data according to statistical models may require 
a complex computer program or a simple one, depending on the method used; 
unless a packaged procedure is available, writing programs ror implement- 
ing the complex methods will require both a substantial programming effort 
and a clear understanding of the types of bias that imputation procedures 
can introduce. 

There are three major statistical packages for handling survey data; 
BMDP, SPSS (Nie et al. , 1975), and SAS. Numerous other packages are 
available at particular computer centers, and analysts should be familiar 
with provisions, if any, for imputing missing data at the computer centers 



they commonly use. I n BMDP, there is a program, BMDPAM, that is very easy 
to use and has five alternative methods: setting values to the mean, plus 
four regression estimates; using one variable, using two variables, using 
all available variables achieving statistical significance, and using all 
available variables. I n SPSS, there is little that can currently be done 
with missing data. The regression and factor analysis routines .in SPSS 
do", however, provide superficial methods for dealing with missing data in 
calculating residuals and factor scores. In SAS, a procedure, PROC 
IMPUTE, has been developed by SAGE under contract to NCES, that is very 
• easy to use and at present has t*» alternative methods, regression 
sub8etting and simple regression. 

The cost of running either the BMDPAM or the SAS PROC IMPUTE program 
on a data file is on the order of magnitude of performing regression 
analyses on the file. The typical cost of runs on 20 variable files with 
1000 cases on the NIH Computer Center IBM 370-168 system has been on the 
order of $10. With this guidebook (or with the BMD? manual), a programmer 
with SAS (or BMDP) should be able to set up a run within an hour. 

(2) How big is the data file ? 

A survey data file has two dimensions of size: the number of variables 
and the number of cases. Each has substantial effects on the cost of 
imputation of missing data as well as on most other analyses. The number 
of cases affects the computer time required, and the number of variables 
affects both the time and storage required. Because imputation by PROC 
IMPUTE requires three passes through the file, compared to two passes for 
many other methods, it may be less attractive in its present form* for 
very large files (e.g., over 50,000 cases). Costs increase linearly with 
number of cases. For any method of imputation that makes use of relations 
among variables, the costs increase more than quadratically with the 
number of variables, however. If the file to be analyzed contains more 



*A11 but the final pass through the data are for the purpose of parameter 
estimation, however, and could be run on a sample from very large files? 



than 80 variables or so, it is advisable to impute variables in blocks of 
50 to 80 each to limit costs. PROC IMPUTE can be called repeatedly on a 
file, with new variable lists, with no difficulty, so the only problem i 3 
to select blocks of variables appropriately. The recommended approach is 
to include variables tha t are highly related to each othar in the same 
block. These relations can be determined either logically or on the basis 
of correlations. As a practical example, if imputation is performed on a 
file that i? the merger of several years' surveys, then all years' values 
for any particular variable should be included in the same block because 
they will be highly related. To capture relations between blocks, 
variable-lisrs for successive blocks after the first should include key 
variables from earlier blocks for use in regression estimates. 

From a theoretical perspective, it is also important to limit the 
number of variables in each block to a small fraction of the number of 
cases on the file (or to be more precise, the number of cases with data) 
to provide for stable estimation of parameters used in the imputation. 
The number of parameters estimated for use in imputation increases with 
the number of variables in each block. The number of parameters to be 
estimated ^an also be controlled by varying the coarseness or fineness of 
the imputation. PROC IMPUTE uses information about the size of the file 
obtained in the first pass through the data in order to determine the 
appropriate number of parameters— the fineness of the imputation—to 
estimate in the second pass through the data. 

(3) What is the purpose for imputing missing data ? 

Imputation should be considered as but a step in a general plan for 
making use of survey d&ta. It follows editing of the data, which should 
remove clearly spurious values from the file, so that they are not 
perpetuated by imputation and later analytical procedures. The selection 
of alternative imputation procedures depends on the uses to which the data 
are to be put. Several alternative purposes for imputation are shown in 
Table 1. 
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Table 1 

PURPOSES FOR IMPUTATION 

(uses of data files) 

to estimate population totals, 
— imputation is fairly easy . 

to estimate relations among measures > 

— imputation must be sophisticated, 

to test a complex set of hypotheses, 
—imputation must be sophisticated. 

to produce a "pu3lic use" file, 

—imputation must be sophisticated; 
particular units should not be identified, 

to measure particular units (e.g., for audits). 

— imputation is not appropriate^ unless 
highly accurate. 



First i if the purpose is merely to estimate population means or 
totals i various methods work nearly equally well. Cases may be reweighted 
within strata, a simple "hot deck 11 procedure (within strata) can be used, 
or linear regression estimates will suffice. Linear regression estimates 
are available in the BMDR package as well as in PROC IMPUTE* To the 
extent that the distributions of respondents and nonrespondents overlap, 
these methods will produce accurate estimates (subject to assumptions 
described in answer to question #6). In fact, for this purpose, it is not 
even necessary to impute actual values; direct "macro-imputation" of 
totals based on summaries of relations between the presence of a variable 
with the values of other variables will suffice. (The term macro- 
imputation is used to refer to methods that can be implemented using only 
file summary data without requiring any additional examination of indivi- 
dual records on the file.) 

To estimate relations among variables or to cest complex hypotheses, 
the second and third purposes in Table 1, a more sophisticated method of 
imputation is* necessary. This is the most common use of survey data in 
report generation. Relations may be presented as correlation coeffici- 
ents} as graphs relating measures, as bivariate frequency tables, or as 
tables of means in different strata. The testing of complex hypotheses 
may go further to examine the factor structure of a set of measures or to 
compare mean differences to error estimates. In all these cases, imputa- 
tion must not unduly distort the distributions of variables. Preservation 
of the multivariate distribution of variables is a problem not considered 
by most statisticians who are studying missing data imputation; it is, 
however, a primary goal of the development of PROC IMPUTE. 

In particular, variances and covariances, as well ss means, must be 
accurately reproduced in order to provide an analy2able file. Assignment 
of mean values, or even linear regression estimates, substantially reduces 
the variances of imputed variables; this problem is overcome, however, by 
procedures that assign values from distributions, such as "hot deck" 
procedures, and procedures that assign values randomly as distributed 
estimates, such as PROC IMPUTE. To preserve correlations among variables, 
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ic is important to avoid imputing variables independently from each 
other. This is accomplished automatically by case reweighting methods and 
M hot deck 11 procedures that replace whole cases. Methods that impute 
variables one-by-one must use imputed values for predictor variables in 
imputing other variables in order to preserve correlations. Although it 
might appear that using imputed values to impute other values only builds 
error on error, the contrary is true when the purpose is to reproduce the 
multivariate structure of a data file rather than to make the best guess 
for each individual case. 

FXOC IMPUTE, unlike BMDPAM, assigns values as distributed random 
variables and uses imputed values in imputing other variables, and as a 
result it generally reproduces variances and correlations more accurately, 
although it reproduces individual values less accurately.* 

If the purpose of imputation is to produce a "public use" file, the 
most sophisticated methods should be used. Because the analyses performed 
on a public use file cannot be predicted, tests of the validity of 
imputation (e.g., based on telephone follow-ups) are important to ensure 
that results of future analyses do not reflect imputation. Moreover, the 
method used should allow for easy estimation of the errors introduced when 
imputed values are included in subsequent analyses. Rubin (1978) has 
recommended producing replicate-files with different imputations so that 
users can perform replications of analyses to estimate the effects of 
variation in imputation. As he pointed out, imputed values will differ 
from recorded values boch due to random error and due to errors in the 
assumptions underlying the model. By including explicit random error 
distributions in its calculations, PROC IMPUTE allows direct estimation of 
the random error component. This is described in Section II of the 
guidebook. 



*This tradeoff appears to be unavoidable. The "SIMPLE" option in PROC 
IMPUTE allows it to mimic the performance of BMDPAM in reproducing 
individual values rather than variances and correlations. 
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Because a trade-off exists between reproducing individual values and 
reproducing distributions, one must frequently be sacrificed if the other 
is to be optimized.* Uses that require accuracy of individual values are 
those in which some future action is anticipated with respect to particu- 
lar cases, such as stratified sampling for a future survey. 

Because of the impossibility of complete elimination of error in indivi- 
dual cases i we recommend that imputed values not be used for purposes 
involving identification of individual cases . Imputation can then focus 
on reproducing distributions. 

(4) What structures exist in the recorded values ? Most surveys have 
internal logical structures, or redundancies, such as blanks for male, 
female, and total counts of staff. Imputation can, but should not be, 
undertaken blindly without cognizance of these structures. Whenever 
possible, constrained missing values should be filled in as a part of 
editing prior to imputation, to simplify the imputation task . For 
example, if male and female counts are present but the total is missing > 
the best method of filling in the total is obvious, but it will be 
different from the best method for use when all three counts are missing. 
The best method in the latter case might involve first estimating the 
total, then the components. 

Analysts should, when possible, construct derived variables that 
indicate characteristics of the cases better than the basic survey 
response variables, such as teacher/pupil ratios for schools. Adding 
these variables to the file will increase the accuracy of imputation as 
well as of other analyses. On the other hand, to avoid bias, it is 
important not to impute values of variables ultimately to be used in 
analysis as nonlinear functions of other variables. For example, if one 



♦Imputing the appropriate modal value for all missing cases is optimal for 
the purpose of individual matching, but this will bias nearly all analyses. 



imputes counts of teachers by multiplying imputed teacher/pupil ratios by 
counts of students, the resulting distribution of counts of teachers will 
be biased/ Derived variables should be used as 1 inear predictors in order 
not to introduce bias. 

Imputation will obviously be more accurate when closer relations exist 
among variables present' and those missing. Therefore, (1) if imputation 
must, be done in blocks of variables, highly correlated variables should be 
included in the same block; and (2) if a high proportion of nonresponse to 
a particular item is expected for certain survey strata, then another item 
or items highly correlated with the target item but more likely to produce 
responses should be included in the survey instrument. An example of 
inclusion of a highly correlated simple variable would be a request for 
grades served by a school in addition to a grade-by-grade breakdown of 
enrollment. 

(5) What is the pattern of missing data? 

Six common patterns of missing data are shown in Figure 1. The 
recommendations for imputation vary between them. If data are missing 
randomly from the file, then imputation is only for convenience. Sta- 
tistical computations based on the incomplete data file will, by defini- 
tion, produce the same results that would have occurred had data not been 
missing, although the effective sample sizes are smaller. This situation 
is so rare that it need not be considered: respondents do differ from 
nonrespondents. For random variation, PROC IMPUTE is to be preferred 
over, for example, filling in mean values, because it reproduces distribu- 
tions. For the case of attrition , when only a small amount of information 
(such as stratification-variable values) is known about nonrespondents, 
weighting is at least as good as other imputation methods, especially if 
the data file is already weighted, because no new complexities are 
introduced into the analyses. This situation is typical of one-time 
household surveys, where only the location of nonrespondents is known. 
When some variable is missing for all cases or is present for so few cases 
that stable parameters of its distribution cannot be obtained, then no 



Figure- 1 
PATTERNS OF MISSING DATA 
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MATRIX SAMPLING DOUBLE SAMPLING POPULATION UNDEFINE1 



Figure 1. Patterns of missing data. Note that data files are repre- 
sented- as RECTANGULAR MATRICES, WITH VARIABLES CONSIDERED 
AS COLUMNS AND CASES. CONSIDERED AS ROWS. MISSING DATA ARE 
REPRESENTED BY DIAGONALLY FILLED AREAS. 



imputation method is appropriate and the variable must be dropped from 
further analysis. This may occur, for example , when an intermediate 
aggregation agent, such as a State Education Agency, decides that no 
information on a particular measure should be reported for any school in 
the states. 

The most common situation is one in which different blocks of vari - 
ables are missing for cases of different "types* 11 The types may be 
determined by the survey designer, for example, by following up nonrespon- 
dents using a shortened form of the survey instrument (e.g., a telephone 
follow-up of a mailed survey). They may also be determined by the 
respondents—respondents with particular characteristics may tend not to 
respond to certain items. This is the situation for which 3MDPAM and FROG 
IliPUTE are most clearly useful. Weighting is an inefficient form of 
imputation in this situation because separate weights must be obtained for 
each variable.* 

One other important pattern of missing data is unknown undercoverage 
of the universe . This will occur when the survey involves defining the 
universe as a combination of lists from numerous sources. One cannot 
always be sure that a sufficient set of sources has been checked to 
identify all members of a universe. If no information is known about 
nonrespondents , including their very existence, then no imputation method 
based on the survey data file alone is meaningful. An external source of 
data, known to represent the entire population, can be used, however, to 
impute missing values. This is commonly done by reweighting survey 
respondents so that their distributions on key variables match the 
distributions obtained from external sources (e.g., Oh and Scheuren, 1978). 



*Cox and Folsom (1979) have proposed a method of variable by variable 
imputation that is mathematically equivalent to reweighting, but this 
method does not preserve relations among imputed variables* 
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(6) What assumptions are acceptable for the imputation ? Every imputation 
method is based on a model of corespondents, a set of assumptions about 
what their responses would hav* been. Statistical analyses are only 
meaningful in terms of these models, so the model must be made explicit 
for any successful imputation procedure. All the models underlying 
methods that do not rely on external data are of the form: nonrespondents 
and respondents are alike, once particular differences are accounted for* 
The various methods differ in what types of differences they take into 
account, as shown in Table 2. 

The assumption that relations among variables are constant is basic to 
nearly every imputation method. This is made explicit in regression-type 
methods, such as used by BMDPAM and PROC IMPUTE, but it is also present in 
all stratification weighting schemes and in hot-deck procedures that 
assign' by strata or according to a nonrandom ordering of the file. The 
validity of the assumption of constant relations cannot be directly tested 
in practice, because data are not available on nonrespondents. An 
approximation can be obtained, however, by comparing relations across 
strata of respondents that differ in ways similar to respondent- 
nonrespondent differences. 

A logical basis exists for the assumption that relations are constant 
even though respondents and nonrespondents may be quite different in level 
and variability of characteristics, and evidence exists to support this 
assumption. Further exploration of the assumption and the conditions 
under which it is satisfied are needed, however. The logic behind the 
assumption is that an observed relation is an observed invariance. Two 
variables cannot be highly correlated unless there is a combination of 
these variables that is nearly constant across the range of observations 
(e.g., counts of teachers and pupils are highly correlated because schools 
hold teacher/pupil ratios relatively invariant). For measures of level or 
variability, no similar invariance exists (other than finding that 
variability is near zero). Empirical results based on Project TALENT 1 3 
special follow-up of 10,000 nonrespondents to its survey of 29-year-olds 
eleven years after high school graduation also support the assumption. 
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Table 2 
MODELS (ASSUMPTIONS) 



"respondents and nonrespondents are alike., once you take 

INTO ACCOUNT • t . ." 



1. NO DIFFERENCES, (V = Y ) 

W RESP 'NONRESP'' 

— IMPUTATION IS ONLY FOR CONVENIENCE. 



2. Y » f (X) is INDEPENDENT OF RESPONSE/NONRESPONSE , 

A. Y is A POINT VALUE OR A DISTRIBUTION. 

B. f IS AN EXPLICIT FUNCTION OR A SEARCH PROCEDURE. 

C. X is ONE, TWO, A FEW, OR MANY DIMENSIONAL, 

D. X is THE SAME FOR ALL Y's OR DIFFERENT. 



Y 3 f (X) DEPENDS ON WHETHER THE CASE IS A 
RESPONDENT OR NONRESPONDENT (FOR VARIABLE Y) , 

—IMPUTATION IS NEARLY IMPOSSIBLE AT PRESENT. 



4. THE DISTRIBUTION OF Y is KNOWN EXTERNALLY • 
— IMPUTATION BY "RAKING," OR REWEI GHTING . 



Y DENOTES THE DISTRIBUTION OF VALUES OF A TARGET 
VARIABLE TO BE IMPUTED, X DENOTES THE VECTOR OF OTHER 
VARIABLES THAT PROVIDE INFORMATION ABOUT Y, AND F IS 
THE FUNCTION RELATING X TO Y, 
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Whereas nonrespondenCs differed substantially from respondents on measures 
obtained in high school and on later survey items, they did not differ 
significantly with respect to important relations. For example, although 
respondents had higher academic aptitude scores than nonrespondents and 
although the distribution of occupations differed between respondents and 
(followed-up) nonrespondents, the difference in academic aptitude between 
respondents and nonrespondents was generally the same across occupations* 

To summarize } on the basis of these six questions, survey designers 
and data analysts should follow the flow diagram shown in Figure 2 in 
planning for, executing) interpreting, and using the results of imputation 
of missing data. Imputation must be planned prior to data collection. 
The most important consideration is to take steps to minimize nonrea- 
ponse. For example, the survey instrument should be carefully pretested 
an d edited ; a sufficient rationale should be developed to convince 
individuals to respond, including letters of support from authorities; and 
a human relationship between the respondent and the person responsible for 
data collection should be established. In addition to minimizing non- 
response and planning for follow-up of nonrespondents, survey designers 
should search for related data to assist imputation. For example, Census 
data can be used to characterize the types of children attending a school 
district that fails to respond to an item on a survey instrument. 

The flow diagram for imputation after data collection has three main 
patha, and we are primarily concerned with the choice to use PROC IMPUTE, 
the most common case. A key step in this process is the examination of 
the results of PROC IMPUTE to determine whether the imputation was 
sufficiently likely to be accurate. There are basically three conditions 
in which imputation can be adequate, in terms of matching distributions. 
First, if only a small amount of data is missing for a variable, imputa- 
tion is not likely to affect analyses involving that variable greatly. 
Even if a large amount of data is missing for a variable, the imputation 
can be considered adequate i£ there is a strong relationship between the 
variable and other measures on the file. As described in Section II, one 
report generated by PROC IMPUTE contains estimates of the strength of 

15 . 
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Figure 2 
Flov Diagram for Imputation 

(Prior to data collection) 

Include simple items correlated with items for which nonresponse is expected. 

Take steps to minimize nonresponse. 

Select a sample of cases for intensive follow-up, pending nonresponse. 
Create a. skeleton data file* 

Merge external information onto the skeleton file. 
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(After data collection and editing) 

Is the purpose of imputation to match distributions for individual cases? 

distribution 
Decide whether to use PROC IMPUTE. — 



individual cas 



no 



Use PROC IMPUTE if 

a) The file contains fewer than 50 , 000 cases, or 

b) Analysis funds ace not severely limited, or 

c) The purpose of imputation is more than just a 
one-time estimation of univariate statistics 
and more than 51 of cases for a key variable 
are missing* 



yes 



i 



Are more than 100 variables 
to be imputed? — — ^— 



yes 



no 



Divide variables into 
blocks of 50-100, so 
that highly correlated 
variables are in the 
same block 



Are the cases weighted? 
no 



yes 



nQ Are there a large number 
of cases with no data? 

yes 



Do not im] 



Do missing cases 
differ from others 
on available 
measures? 



yes 



Use 

macro-L 
regress: 
analysi: 



no 

i 

Are estimates < 
totals required?- 

no 





Use 


yes 


macro-L 




mean 




substiti 



Ignore missing data. 



-Delete these cases temporarily. 



Run PROC IMPUiE. 

Examine results and delete variables with excessively poor data base (too high % missin, 
too low r2, too much nonresponse bias) . 

Were cases deletsd temporarily? ye » Weight cases within strata to represent the 
| the deleted nonrespondents . 



no 



Is the universe defined acceptably? 

lExiC.L^ , _ 



no 



Weight cases to match externally produced 
distributions* 
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relations used for imputation. Finally, even if there is a reasonably _ 
large amount of missing data (e.g., 50%) and a fairly weak relationship 
(e.g., r 2 9 .25), the imputation may adequately reproduce distributions 
if respondents do not differ from nonrespondents. One report generated by 
PROC IMPUTE displays the differences (on all other variables) between 
cases with a particular variable present or missing. Further work will be 
necessary to determine the appropriate combinations of these shree 
conditions to use in making final decisions concerning acceptance or 
rejection of a particular variable's imputation. 
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II. THE NEW NCES ALGORITHM; PROC IMPUTE 



After reviewing the options for imputation available in the common 
statistical packages, it was determined that something more was needed. 
The BMDPAM program in the BMDP series can satisfy a limited range of 
imputation needs , but the strong bias in variance and covariance estimates 
generated from the values imputed by BMDPAM left much to be desired. 
Other special purpose programs are not readily available and are ineffi- 
cient to use because the user must devote considerable time to defining 
input and output formats and other parameters describing the data. Since 
this effort is already included in the use of she statistical packages for 
analyses, no extra effort is needed if an imputation procedure can be 
included within one of these packages. 

It was decided to implement a new routine for missing data imputation 
in the Statistical Analysis System (S*S) because of the ease of implement- 
ing new routines in SAS, the great flexibility of this system for data 
manipulation, and the high level of use of this system. The use of £AS 
has increased dramatically over the past two years and now surpasses the 
use of SPSS or BMDP at most installations where it is available. (See 
recent NIH cociputar facility usage statistics for example). 

The procedure implemented, PROC IMPUTE, is a distributional estimation 
procedure that is believed to be more general and to produce more accurate 
results than a standard M hot deck" procedure. Basically, this procedure 
considers each variable on the file in turn as a "target 11 variable whose 
missing values are to be filled in and uses information on other variables 
to minimize the error in imputing each target variable. For each "target 11 
variable, regression analysis is used to find the best combination of 
predictors, and cases with the target variable present are divided into 
subsets based on values of the regression function. All ca3es in a given 
subset that are missing the target variable then have values assigned with 
random frequencies proportional to the distribution of reported values for 
that variable within the subset. The basic assumption of this algorithm 
is that within these homogenous subsets , the missing value cases will have 
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the same target value distribution as the cases with reported values on 
the target variable. 

The following sections describe the ?ROC IMPUTE procedure more 
explicitly. The next section describes the algorithm in more detail. 
This is followed by sections that describe the steps necessary to run PROC 
IMPUTE and how to interpret the results of PROC IMPUTE. Time requirement 
estimates are given in Appendix A. 

How PROC IMPUTE Works 

PROC IMPUTE makes three passes through the input data file. The 
processing that occurs during and between each of these passes is des- 
cribed here in general terms to document the statistical algorithm. The 
specific input statements needed to run PROC IMPUTE and the output 
generated by PROC IMPUTE are described in later sections. 

During the first pass through the data, basic univariate and bivariate 
statistics are computed. These include the mean, standard deviation, 
minimum, maximum, and number of missing values for each variable, the 
intercorrelations among the variables, and the number of cases missing one 
variable but not the other for each pair of variables (as well as pairwise 
means and standard deviations). Reports 1 through 3, described later, 
print out this basic information for the user. 

Following the first pass through the data, stepwise regression 
analyses are performed "simultaneously" for each variable to be imputed. 
During these analyses, an ordered list of the imputation variables is 
constructed, and the regression analysis for each variable is limited to 
predictors that "precede" the target variable in the imputation list. The 
determination of the optimal ordering is a complex procedure based on 
relative amounts of missing data and the relative strengths of relations 
among variables* Initially no restrictions are imposed. Then, at each 
step, one predictor variable is added to one regression equation and 
additional restrictions are imposed by the fact that the new predictor is 
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forced to "precede" the target variable. The predictor-target pair 
selected at each step is that pair that will provide the greatest incre- 
ment in the variance explained for all of the missing values (of all 
target variables). This process terminates when there are no more 
permissible predictors that provide a significant increase in the predic- 
tion of any of the target variables. 

The predictions derived f^om these restricted regression equations may 
not be optimal. If variables X and Y are closely related, each should be 
used in the imputation of the other when possible. To allow for this, 
some variables mutt be imputed twice, considering the first imputation as 
a "ghost imputation" to be replaced later. Once the initial imputation 
list and the associated regression equations have been constructed, the 
imputation target variables are each reexamined (in their order in the 
imputation list). Additional regression equations are generated whenever 
ihe addition of "follower" variably would significantly improve the 
prediction. 

Finally, fcr each regression equation, a number of subsets are defined 
in terms of regression function values. Within each subset, the distribu- 
tion of target variable values can be expected to have a much smaller 
variance than overall, if the regression equation represents a strong 
relation. (The number of subsets is defined in terms of a trade-off 
between fine-grain-nesa and stable parameter estimation. The number will 
vary with the expected number of cases with "complete data" for the 
regression equation variables.) 

During the second pass through the data, regression function values 
*re computed for each case and each equation where all the required 
variables are present, including the target variable. The complete 
bivariate frequency distributions of the regression function values and 
their associated target variables are estimated by counting the number of 
cases in each regression value subset at each level of the target vari- 
able. Following the second pass, each bivariate frequency distribution is 
converted to separate probability distributions for each regression 
subset. Figure 3 shows an illustration of these separate distributions. 
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FIGURE 3. Distribution of target variable for each regression-function subset. 
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During the second pass through the data, the mean regression function 
value in each subset is also computed to provide information for interpo- 
lation between the distributions in adjacent regression subsets. 

In the-final pass through the data, missing values are imputed for 
each case. For each of the regression equations where a target value is 
missing, the regression function value is computed. The appropriate 
regression value subset and the adjacent subset are identified. A unifoxm 
pseudorandom variable between 0 and 1 is generated, and a value ia 
computed for imputation of the target variable for each adjacent subset, 
based on the pseudorandom variable. The pseudorandom value is considered 
to be a probability, and the point on - each cumulative distribution 
function (obtained in the second pas* through the data) corresponding to 
that probability is identified (i.e., the inverse of the cumulative 
distribution function is applied to the random variable) . If the "SIMPLE" 
option is specified, the pseudora-adom variable is reset to .5 so that the 
median value for the subset is always selected. The imputed values 
obtained for the two adjacent subsets are then averaged according to the 
distance of the mean regression value in each subset from the regression 
value for the case being imputed. This average value is rounded to an 
integer if the integer flag ia set for the target variable. 

After all missing values have been imputed for a case, the case is 
written to the output file with all of the missing value* filled in. 
Missing data flags are also created and set for each variable with a value 
of "I" corresponding to imputed values, a blank value for real values. 



How to Use PROC IMPUTE 

To use PROC IMPUTE, you must specify (I) the job control language 
(JCL) statements to execute SAS, to specify data sets, and to include the 
IMPUTE program in the standard SAS program library and (2) the SAS 
statements that call PROC IMPUTE. Figure 4 shows both kinds of statements 
for a sample run of PROC IMPUTE at the NIH Computer Facility. (PROC 
IMPUTE is currently being installed at the Data Management Center. The 
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Figure 4 
SAMPLE JOB CONTROL CARDS 
• w for Running PR0C IMPUTE on a SAS 
System FIle (at the NIH Computer Center) 

//ACCTINIT J0B (acct, class, time, lines), USERNAME 
// EXEC RUNSAS, REGIOfHOOKl 

//LIBRARY DD DSR-WPG4T00.SAGELIB, UNIT-FILE, V0L-SER-FILE26, 
// DISP»SHR 

//FT06F001 DO DUMMY 

//OLD FILE DD DSN-your old file, VOL-SER=your volume number, 
//NEW FILE DD DSN=your old file, VOL-SER-your volume number, 
//SYS IN DD • 

TITLE your run of PR0C IMPUTE: (optional) 

PR0C IMPUTE DATA = 0LDFILE.SASname 

OUT-NEWFILE.SASnamej 

VAR (list of variable to be processed. If omitted, 

ALL NUMERIC VARIABLES WILL BE PROCESSED); 
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library name and SAS procedure name will vary slightly, but otherwise the 
same statements will be required.) The remainder of this section des- 
cribe* each of the required statements more fully. 

Job Control Statements 

The JOB statement is the same as for any other run. See Appendix A 
for sample time estimates. 

The EXEC statement uses the normal cataloged procedure for SAS. 

The LIBRARY statement points to the SAGE library containing the 
program for PROC IMPUTE. The cataloged SAS procedure concatenates (adds) 
this library to the standard SAS library. In addition to the DSN (data 
see name), the UNIT (device type), VOLume (specific disc pack), and 
Disposition (SHE for share) must be specified. 

The FT06F001 DP (data definition) statement is required by some of the 
IMSL (International Mathematics and Statistics Library) subroutines that 
print warning messages. Since PROC IMPUTE reacts to these warnings 
itself, they need not be printed. The example shows how to specify a 
"dummy" output file for £hese warning messages. 

The file data definition statements tell PROC IMPUTE the name and 
location of the input and output data files. If only PROC impute is run, 
these will be SAS system files. It is possible, however, to include other 
SAS statements to read and/or write raw data files and perform other 
analyses in the same run. If an output file is not specified, the imputed 
values will only be retained on a temporary file for use in the same run. 
Section 8 of the SAS Manual and the section on DD statements in the IBM 
JCL manual give complete information on the optional and required parame- 
ters associated with the data definition statements. 
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Finally, che SYSIK statement signals the beginning of the S,.IS 
statements. 



SAS Statements 



T* 1 ' PROC IMPUTE statement invokes the imputation procedure and 
provides key information for the imputation. SAS parses this statement ' 
using a "frea field" format so that column positions do not matter. After 
the words "PROC IMPUTE," che parameters may be in any order. The follow- 
ing clauses may be included: 



l « DAIA»ddna ma . SASname points to the input data file. The "ddname" 
refers to the label used in the JCL. If omitted, a temporary SAS 
file is assumed. The SAS name is the internal data set name used 
by SAS. If no input data set is specified, the last data set 
created by SAS in this run is processed. 

2 * OUT«ddnam«.SASaama pointe to the oueprt data file. As before, 
ddname refers to a particular DD statement in the JCL, and 
SASaame is the internal file name. If no ddname is specified, a 
temporary SAS file is created. If no output file data set is 
•pacified, a temporary data set is created using the standard SAS 
file default names. 

3. SIMPLE is an optional keyword. If included, the SIMPLE option is 
invoked and the imputed values are all set to the median value of 
the target variable in the appropriate regression value subset. 
The use of this option is not recommended if there is any chance 
that variances and covariances will be analyzed, if thi* keyword 
i« omitted, the default option is used and values are imputed 
randomly according to the target value distribution for the 
appropriate regression value subset. As with all SAS statements, 
the IMPUTE statement ends with a semicolon. 

The VAR statement follows the PROC IMPUTE statement (possibly on the 
same line) and specifies the variables to be processed by PROC IMPUTE. If 
this statement is omittad, all numeric variables in the input data set 
will be processed. After the keyword VAR, the names of the variables to 
be processed are listed, separated by spaces. Only numeric variables may 
be included. The order of the variables in the VAR statement determines 
their order in the first three reports and also corresponds to the 
numbering of the missing data flag variables (MFLAGn) generated by PROC 
IMPUTE. The processing time and storage requirements depend primarily on 
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Che number of variables included in this statement. (See Appendix A for 
examples.) The VAR statement ends with a semicolon. 

The Output Data Set 

In addition to the missing data reports described in the next section, 
FROC IMPUTE generates an output data set. The output data set is a 
standard SAS systems file and includes each of the variables specified in 
the VAR list plus a missing data flag variable for each of the variables 
in the VAR list* The missing data flag variables have names MFLAG1 to 
MFLAGn where n is the number of variables processed. For example, the 
statements M PR0C IMPUTE; VAR X Y 2; ,f would produce a'n output file contain- 
ing six variables: X, Y, Z, MFLAG1, MFLAG2 , and MFLAG3. MFLAG1 would be 
set to the value "I" for all records in which X was imputed, and to the 
value M 11 (blank) for all records in which X was already on the file. 
Similarly, MFLAG2 end MFLAG3 would indicate whether Y and Z were imputed 
or actual values. The flags are character variables of length 1. These 
flag variables may be given new names by attaching a RENAME statement to 
the output data set specification in the impute statement. For example, 
"PROC IMPUTE OUT*DSKOUT. MYFILE ( RENAME* (MFLAG1»MVAR1 MFLAG2-MVAR2 
MFLAC3»MVAR3); M would assign the names MVAR1, MVAR2, and MVAR3 to three 
missing data flags. (See the SAS manual for further information on 
renaming variables.) 

Not all variables on the input data set need be included in the VAR 
list for PROC IMPUTE; any variables not in the VAR list will not be on the 
output data set of PROC INPUTE. To combine the imputed values with the 
other variables not included in the VAR list, it is sufficient to execute 
the following SAS MERGE. 

DATA MERGEDOUT; 

MERGE OLD FILE NEWFILE; 

No "BY" statement is necessary because the file containing imputed values, 
NEWFILE, is a re cord-by-re cord transformation of the original data set, 
OLDFILE. If variables are imputed in blocks (e.g., 200 variables imputed 
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in four blocks of 50), a MERGE must be inserted after each call to PROC 
IMPUTE if some variables imputed in each block are used in imputing 
variables in other blocks. 

Limitations, of PROC IMPUTE 

The following limitations apply to Version 1 of PROC IMPUTE. Some of 
these limitations will be removed in subsequent versions. 

1. Only numeric variables can be processed. Character variables must 
be recoded prior to PROC IMPUTE if they are to be imputed. 

2. Categorical variables are treated as if they vere ordered, in the 
derivation of regression equations and subsets. This may not lead 
to an optimal set of predictors for these variables or to their 
optimal use in predicting other variables. It may be desirable to 
recode categorical variables into a series of dichotomous 
indicators prior to using PROC IMPUTE. 

For example, a school might be either "for girls only," H for 
boys only/ 1 or "for both boys and girls," coded "1," "2, 11 "3. M In 
this case, two dichotomies that might be useful in prediction 
would be (1) to combine " for girls only" with "for boys only," a* 
opposed to coeducational and (2) to combine "for boy? only 11 with 
"for both boys and girls," as opposed to schools not for boys. In 
this case, the original three-valued variable could easily be 
reconstructed from imputed values on the two dichotomies. Note 
that although it is theoretically possible for the imputation to 
produce conflicting values for the dichotomies, these cases should 
be very rare because no conflicts exist in the observed data and 
because one of the two dichotomies will almost certainly play a 
strong role in the imputation of the other. Nevertheless, the 
coding to reconstruct a categorical variable from dichotomies must 
handle possible conflicts. 
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3. Many survey instruments are designed so that certain items 
determine whether other items are to be skipped or not (e.g., 
respondents who did not attend college are not asked to indicate a 
college major). This version of PROC IMPUTE does not include a 
provision for indicating that certain values are to remain missing 
after imputation. There are basically two methods for handling 
"skip patterns" with the current version of PROC IMPUTE. (1) The 
file containing imputed values can be re-edited to set appropri- 
ately skipped items back tc "missing." Alternatively, (2) 
variables conditional on a particular item can be imputed in a 
separate block, after the conditioning item has been imputed, and 
only for the subfile of cases for which the variables should be 
imputed. Because it is less expensive to make a series of calls 
to PROC IMPUTE on small blocks of variables than a single call on 
a large number of variables, it is advisable to handle a complex 
skip pattern through a series of calls to FRCC IMPUTE on appropri- 
ate subfiles. The SAS system greatly facilitates the file 
manipulation (extraction of cases and later merging) needed for 
this. 

4. Case weights are not. used in the estimation of the imputation 
parameters. By including the weight variable in the variable 
list, however, it is possible to eliminate any first-<;rder (but 
not interaction) effects associated with differential case weights. 

5. While the number of variables processed by PROC IMPUTE is theore- 
tically unlimited, the storage and processing time requirements 
(i.e., costs) increase dramatically for larger numbers of 
variables (over 50 or so). 




Reports Generated by PROC IMK ffB 



Missing Data Report #1: Miaaing Data Frequencies and Univariate 
Frequencies 

Figure 5 shows an example of the first report generated by PROC 
IMPUTE. This report provides information on the amount of missing data 
and on the basic univariate characteristics of each variable. Specifi- 
cally, the following information is provided: 



Column Description 



1 Variable name . The variables are processed in the order speci- 
fied by the VAR list. The order of the variables is important 
because the missing data flags are numbered in this order. If no 
VAR list is specified, all numeric variables are selected 
according to their position in the file. 

2 The number of cases with missing values for this variable . Note 
that the specification of missing values is part of the work 
inherent in the creation of a SAS systems file. Soe Chapter 6 of 
the SAS Manual. 

3 Pie percent of cases with missing values for this variabl e 

4 The number of cases with valid values for this variable . 

5,6 The minimum and maximum reported values. Imputed values will 
always lie within the range of the reported values. When 
continuous or many-valued discrece variables are sliced into a 
smaller number of distinct levels, the minimum and maximum values 
are used as endpoints of the lowest and highest levels 
respectively.* 

7 The integer/decimal flag . During input, the reported values are 
checked to see whether any noninteger values are present, if all 
reported values are integers, the variable is flagged an 
"integer" and all imputed values will be integers. If any of the 
reported values are nonintegers, the variable is flagged as 
"decimal 11 and noninteger values will be imputed. 



* In Version 1A of PROC IMPUTE , available in August 1980, the minimum for 
many-valued discrete variables may print out as a very small positive 
number instead of zero. This is of no real consequence, but will be 
corrected in Version 2. 
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3 The number of distinct levels to be used in approximating 

distributions for this variable. The program selects an optimal 
number of levels based on the number of cases with reported 
values. A greater number of levels is selected when more cases 
are available to use in estimation. If the variable is flagged 
as integer, however, and the actual range does not exceed twice 
tlie optimal number of values, then the number of integers in the 
range is used; otherwise, the optimal number of values is used. 

9 »10 The mean and standard deviation of the reported values are 

shown. The statistics are used later to generate raw regression 
coefficients. They are presented here to allow users to check 
the reasonableness of their input and as a reference rior later 
information. 

11 Variable labels , if present, are shown to aid in the identifica- 
tion of each variable. 

Missing Data Report 02: Characteristics of Cases with Missing Values 

Figure 6 shows an example of the second missing data report. This 
report summarises the information that is available on cases with missing 
values. Each row of this report focuses on cases with missing values for 
a particular variable. Bach column (except for diagonal cells) presents 
information on the missing value cases with respect to a particular other 
variable. For example, information in column 2 (PA5TSTAT) and row 1 
(variable PA3TLADA) iadicates how cases with and without PA3TLADA missing 
differ in terms of PA5TSTAT. 



The first two entries in each cell give the mean and standard devi- 
ation of the column variable for cases with missing values on the row 
variable. For example, the mean value of PA5TSTAT for cases missing 
PA3TLADA is 34.3032, compared to an overall mean (shown in the diagonal 
cell) of 34.2886. In general, the column variable will be present for 
only a portion of the cases that are missing the row variable. The third 
entry gives the number of cases missing the row variable but not the 
column variable. For example, 432 cases were missing PA3TLADA but not 
PA5TSTAT. The fourth entry is the phi coefficient describing the correla- 
tion of the presence of data on the row variable with the presence of data 
on the column variable. (Note: Under Version 1A, the phi coefficient is 
incorrectly computed and should be ignored). The fifth entry in each cell 
givea a t-statistic measuring the extent of the column variable 
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Note. N/A Indicated that the statistics could not be computed because no data fit the 
constraint (presence of column variable, but missing the row variable). 
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mean difference between caaea miaaing the row variable and the sample as a 
whole. The aix'ca and final entry in each cell givea the aignificance of" 
thia t-atatiatic. For example, the t-staLiatic comparing valuea of 
PA5TSTAT between caaea with and without valuea on PA3TLADA ia 0.010. Thia 
atatiatic ts important for evaluation of the confidence that should be 
placed in imputationa. Where aubatantial differencea exiat, the likeli- 
hood of deviation from the aaaumptiona of the model are increaaed. 
Therefore, variablea with aubatantial differencea on key variables (in the 
eatimation of ihe aurvey analyat) should be examined to evaluate (1) 
whether their miaaing date are ao frequent and (2) whether their imputa- 
tiona are ao poor, in terma of error variance, that the variablea ahould 
be deleted froia further analyses. 

The fir at three entriea in each cell can be compared with the cornea- 
ponding entriea in the column'a diagonal cell. The diagonal cella give 
the meana, atandard deviations, and number of all reported valuea for the 
column variablea. Compariaon of the valuea in each cell with the valuea 
in the diagonal cell for that column indicatea the extent to which the 
caaea miaaing the row variable differ from the sample aa a whole, at leaat 
inaofar aa that can be known given that the column variable may itaelf 
have miaaing valuea. 



The information preaented in thia report ia helpful in underatanding 
the nature of the miaaing data in a particular aurvey. To the extent that 
theae reauita indicate more frequent nonreaponae for particular typea of 
caaea, it may be poaaible to modify future data collection procedurea to 
decreaae the nonreaponae and omit ratea for theae caaea. (For example, if 
the omit rate for aome itema were cloaely related to the reapondent'a 
reading ability, it might be poaaible to decreaae this omit rate by 
aimplifying the wording of theae itema.) 

, Miaaing Data Report 03; Correlationa between Reported Valuea 

Figure 7 ahowa an example of the third report generated by PROC 
EAPUTE. Thia report ahowa the correlationa between each pair of variablea 
baaed on ail caaea for which both variablea are preaent. The number of 
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cases with both variables and the significance level of the correlations 
are also printed. Ihese correlations provide the basis for ..estimating 
prediction equations for imputation. The information presented in this 
report is virtually identical to the information printed by the SAS 
routine PROC CORR. It is presented here to eliminate the need for a 
separate run of PROC CORR. (Version 2 of PROC IMPUTE will include an 
option for omitting this report if PROC CORR has already been run.) 

Missing Data Report #4: Regression Equations 

Figure 8 ahows an example of the fourth report generated by PROC 
IMPUTE. This report shows the regression equations used with each 
variable to be imputed (target variable). Regression functions are 
generated only for variables with some missing data. These variables are 
ordered so as to maximize the total variance accounted for in the predic- 
tion of all missing values when each variable is predicted only from 
preceding variables. This ordering is necessary to ensure that any 
missing values among the predictor values will have already been filled in 
before the variable is used as a predigtor in a regression function. 
(Variables with no missing values are placed at the beginning of the list, 
thus they precede all of the variables to be imputed.). 

After an equation has been generated for each variable to be imputed , 
each of the variables in the list is reexamined to see if its prediction 
could be significantly improved by including "follower" variables in the 
prediction equation. If so, a second equation is generated, and both 
equations will appear in Report #4. The variable will then be imputed a 
second time after an initial ( H ghost H 7^mputation has been performed for 
each of the missing values. 

The leftmost columns show the target variable for e*ch equation and an 
estimate of the squared multiple correlation, which is the proportion of 
variance of the target variable accounted for by the predictor variables. 
The actual variance accounted for may differ somewhat from the estimate 
shown here because: 
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Wote. Variables In the right-hand dolunins are not Included In the regresslor 
are ordered, from left to right, In decreasing order of partial covarl 



Figure 8. PR0C IMPUTE Report #4 
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1. The multiple correlations are estimated from the pairwise correla- 
tion coefficients, which' are not all based on the same cases, 
whereas the final parameter estimates are based on only those 
cases with reported values for the target and all of the predictor 
variables; multiple correlations calculated from correlations 
based on different cases should not be interpreted as meaningful, 
although they prove useful as a tool in accomplishing the 
imputation;- ■. 

2. The actual prediction is nonlinear and so may account for more 
variation ttian a linear predictor function; and 

3. The actual prediction uses discrete levels for the target variable 
and discrete subsets based on the regression function values, 
while the multiple r2 shown in Report #4 is based on a 
"continuoua" predictor function. 

The second set of columns shows the predictor variables to be used and 
the standardized and raw coefficients to be used with each predictor 
variable. Only variables with significantly nonzero coefficients are 
included in order to improve cross validation and computational effici- 
ency. The raw regression coefficients give regression function valuea 
ranging from zero to the number of regression value subsets selected for 
this variable. Thus, a simple rounding of the regression value gives the 
index of the distribution to be used in the final imputation. As a 
result, the raw regression coefficients do not necessarily yield t value 
in Che same units ae the target variable. 

Thft final set of columns show each of the variables not in the 
equation and a number (labeled PART COV) which, when squared, gives an 
estimate of the additional percentage of variance (rather than the 
percentage of additional variance) that would be accounted for if this 
variable were added to the equation. 

In the example TA3AN0ST (question A3 - how many students does the 
teacher have) is predicted by PA3TLADA (the total ADA for the school 
reported by the principal) and by TA1G912 and TA1C79 (whether this is a 
junior high or high school teacher). The squared multiple correlation is 
.410; 412 of the variance in TA3ANOST is accounted for by these 
predictors. Of the variables not in the equation, TA3BNOCL (the number of 
classes taught by this teacher) would improve the prediction the most, 
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explaining an additional 17% of the variance (.43 2 ). The variable was 
not included because it had yet to be imputed* so that its value might be 
missing. In a second equation for TA3AN0ST (not shown) , the variable 
TA3BN0CL was, in fact, included. The next equation shown, in fact, 
predicts TA3BM0CL from TA3ANOST with a squared multiple R of .726 (the 
correlation between these two variables is .85). 

The final equation in Figure 8 predicts TA1G13, whether the teacher 
teaches grades 1 through 3, This is predicted by the number of classes 
taught. The standardized regression coefficient is -.49, meaning that * 
teaching grades 1*3 is predicted by a low number of classes. Since there 
is a single, discrete predictor, this casa is handled a little differ- 
ently. The regression value subsets will' correspond exactly to the 
distinct values af the predictor variable. As a result the raw regression 
coefficient has been set to 1 with the constant left undefined. 

Missing Data Report »S: Conditional Distributions 

Figure 9 shows an example of the fifth report generated by PROC 
IMPUTE. This report shows the cumulative distribution of each target 
value in each regression value subset. The first column of this report 
shows the regression subset number. 

The second column gives the numbor of cases with values for both the 
target variable and the regression function. This is the number of cases 
used in estimating the target variable distribution for that subset. 

The third column shows the mean regression function value for this 
subset. This value is used in interpolating between subsets. In the 
first example predicting TA3BN0CL (number of classes), the first regres- 
sion valua subset included all cases with values below 1.0. The mean 
regression value of the 699 cases in this subset is .772. The second 
subset includes cases with regression values between 1.0 and 2.0. These 
245 cases have a mean regression value of 1.430. If a case for which 
TA3BN0CL was missing had a regression function value of 1.101, then the 
imputed value would be halfway between the value imputed from the subset 1 
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STATISTICAL 
mSSIIIG DATA REPORT I5> COHOII 10IIAL DISTRIBUTIONS 
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Cumulative distributions correspond to frequency distributions such as shown in Figure 3, 
FIGURE 9. PR0C IMPUTE Report #5 
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distribution and the value imputed front the subset 2 distribution More 
generally, if the regression function value is equal to p x (mean for 
interval i) + (1-p) x (mean for interval i+1), the imputed value would be 
p x (imputed value from distribution i) + (1-p) x (imputed value fro©, 
distribution i+1). 

The fourth and fifth columns of this report show a mean and standard 
deviation for the target variable for each regression value subset. For 
continuous variables, the values shown are the mean and standard deviation 
of the (integer-valued) level number rather than of the variable itself. 
In the first example, a discrete variable, the 699 teachers in the first 
regression value subset taught an average of 1.199 classes while the 7 
teachers in the ninth subset taught an average of 7.000 classes. 

The remaining columns in this report show the proportion of cases in 
each subset that have target variable values at or below the indicated 
level. (The highest level is omitted since all of the cases are at or 
below this level,) In the example, 92.1% of the teachers in subset 1 
taught only one class and all 699 teachers in this subset taught six of 
fewer classes. The second target variable in the example, teaching grades 
1-3, is dichotomous. The number in the rightmost column are the propor- 
tions in each subset with a value of zero (the proportion not teaching 
grades 1-3). Recall from Figure 8 that the single predictor variable is 
number of classes taught. Here 40.72 of those teaching one class teach 
other than grades 1-3, while over 902 of those teaching four to six 
classed teach other than grades 1-3. 

The row at the bottom of each table in this report shows the overall 
mean and standard deviation of the target variable (in integer level 
units) and the average standard deviation within each subset. The R SQ 
measure (actually an eta squared since the relationship may not be linear) 
indicates the reduction in variance due to the differences between subsets. 
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The final report generated by PROC IMPUTE is not yet fully developed. 
It is designed to show the error variation in each variable due to missing 
values* Each time a variable is imputed, the target variable variance for 
the appropriate regression value subset is added to a total for that 
variable. There is no missing data error variation for nonmissing values 
so nothing is added to the totals for these cases. The final totals are 
then divided by the total number of cases to give an average missing data 
error variance for each variable. If no cases were missing values, then 
the average will be zero. Similarly if all missing values are imputed 
with certainty (the within subset variances were all zero) , then the final 
average error would be zero. 

The final error variance estimates are printed at the end of Report 
#5. A number is given for each variable in the VAR list and the nucbers 
are in order of the variable's position in this list. The variances shown 
are currently in integer level units and must be referenced to the Total 
S.D. in Report #5. This measure can be used to assess the random compo- 
nent of the error due to imputing a value rather than collecting real 
data. An R SQ less than .25, for a target variable with substantial 
missing data and for which uonrespondenta differ significantly from 
respondents (Report #2), indicates generally poor imputation of the target 
variable. 
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PROCESSING Time by Numbers of Variables and Cases 
for BMDPAM and Proc IMPUTE* 



Hiimbfr of Variables 



10 2D _iQ_ _JL_ -J2_ 

mber of Cases MM WSL MM IHEUIE ME UfliK HfUE 

100 .9 

500 3,2 2,0 7.0 4,6 8,3 

1000 11,8 8.2 

1179 37,5 
1994 105.2 

'rocessing in time is in CPU seconds for an IBM 370/168 in an MVS environment. The 
5MDPAM runs used the REGR option. 
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APPENDIX B 

Sample SAS Program 
to Reweight for Total Nonresponse 



DATA TENPlj 

SET DDNAMEl. YOUR FILEj 

* INCLUDE ANY C0DE NEEDED T0 SET THE 

* 'CELL', 'N0NRESP', and 'WEIGHT' VARIABLES' 

* ALSO TO C0MBINE CELL AS NEEDED 

* NOTE: CELL » UNIQUE. FOR EACH CELL 

N0NRESP-1 F0R N0NRESP0NDENTS, 

0 0THERWISE 
WEIGHT -CASE WEIGHT TO BE RESET ■ * 

PR0C S0RTj BY CELL N0NRESPj 

PR0C MEANS j BY CELL N0NRESPj 
VAR WEIGHT j 
0UTPUT 0UT - TEMP2 
SUM = SUMWTj 

DATA WTADJSj BY CELLj 

RETAIN SUMRESP Oj 

IF LAST. CELL THEN G0 T0 SETWTj 

SUMRESP « SUMWTj DELETEj 
SETWT: 

IF SUMWT LE 0 THEN G0 T0 CELLERRj 
WTADJ » (SUMMWT + SUMRESP) / SUMRESPj 
SUMRESP = Oj 

KEEP CELL WTADJj RETURN j 
CELLERR: 

PUT CELL* 'HAS N0 RESP0NDENTS BUT HAS' 
SUMWT 'WEIGHTED N0NRESP0NDENTS' j 
PR0C PRINTj * T0 PRINT WEIGHT ADJUSTMENTS j 

DATA DDNAME2.NEWFILEj 

MERGE TEMPI WTADJSj BY CELLj 
IF N0NRESP EQ 1 THEN DELETEj 
WEIGHT - WEIGHT * WTADJ j 
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